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ABSTRACT 

Subspace  iteration  is  a  reliable  and  cost  effective  method  for  solving  positive  definite 
banded  symmetric  generalized  eigenproblems,  especially  in  the  case  of  large  scale  problems. 
This  paper  discusses  an  algorithm  that  makes  use  of  two  parallel  banded  solvers  in  subspace 
iteration.  A  shift  is  introduced  to  decompose  the  banded  linear  systems  into  relatively 
independent  subsystems  and  to  accelerate  the  iterations.  With  this  shift,  an  eigenproblem 
is  mapped  efficiently  into  the  memories  of  a  multiprocessor  and  a  high  speed-up  is  obtained 
for  parallel  implementations.  An  optimal  shift  is  a  shift  that  balances  total  computation  and 
communication  costs.  Under  certain  conditions,  we  show  how  to  estimate  an  optimal  shift 
analytically  using  the  decay  rate  for  the  inverse  of  a  banded  matrix,  and  how  to  improve  this 
estimate.  Computational  results  on  iPSC72  and  iPSC/860  iw 
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1  Introduction 


Eigenvalue  problems  arise  in  many  areas  of  physics  and  engineering  such  as  the  analysis  of 
electron  orbits  in  atoms  and  the  stability  of  structures.  Due  to  the  large  number  of  applica¬ 
tions  of  such  problems,  there  is  a  constant  demand  for  algorithms  for  computing  eigenvalues. 
The  development  of  efficient  algorithms  has  received  considerable  attention  in  the  litera¬ 
ture  [1]  [2]  [23]  [29].  With  the  increased  use  of  advanced  computers,  parallel  algorithms 
are  also  becoming  available  [11]  [17]  [19]  [20]  [22]  [24]  [28].  Most  of  these  algorithms  begin 
by  reduction  of  the  problem  to  a  standard  form.  This  is  particularly  true  for  generalized 
eigenproblems  [5]. 

This  paper  presents  two  parallel  banded  linear  solvers  and  their  application  for  general¬ 
ized  positive  definite  eigenvalue  problems,  in  which,  only  a  few  of  the  smallest  eigenvalues 
and  corresponding  eigenvectors  are  needed  to  moderate  accuracy.  This  type  of  problem  arises 
in  structural  analysis  and  other  engineering  fields  [2]  [15].  Among  all  solution  methods,  two 
families  are  most  popular:  subspace  iteration  [2]  [4]  [25]  [26]  and  the  Lanczos  method  [12] 
[16].  The  Lanczos  method  has  been  shown  to  be  superior  to  subspace  iteration  on  sequential 
and  vector  machines  [15]  [21].  The  comparison  of  the  two  families  on  a  parallel  computer 
is  relatively  unknown.  Though  the  Lanczos  method  is  strongly  favored  by  mathematicians, 
subspace  iteration  is  more  often  used  in  engineering,  particularly  in  structural  analysis.  This 
may  be  because  subspace  iteration  is  conceptually  simple  and  closely  associated  with  sub¬ 
structure,  based  on  which,  one  can  often  construct  an  approximate  subspace  from  experience. 
Many  software  codes  are  still  using  subspace  iteration  for  computing  a  few  dominant  eigen¬ 
values  and  corresponding  eigenvectors.  The  major  effort  in  subspace  iteration  is  devoted  to 
solving  banded  linear  systems.  This  paper  discusses  an  algorithm  that  makes  use  of  two 
parallel  banded  linear  solvers  in  subspace  iteration.  The  main  feature  of  this  algorithm  is 
the  introduction  of  a  shift  that  decomposes  the  banded  linear  system  into  relatively  inde¬ 
pendent  subsystems  and  accelerates  the  subspace  iteration.  We  shall  show  when  this  shift 
is  applicable,  how  to  estimate  this  shift  analytically  using  the  decay  rate  for  the  inverse  of 
a  banded  matrix,  and  how  to  improve  this  estimate.  The  comparison  of  subspace  iteration 
with  the  Lanczos  method  is  beyond  the  consideration  of  this  paper. 

This  paper  is  organized  as  follows.  In  Section  2  two  parallel  tridiagoiial  solvers  [27]  are 
extended  to  banded  solvers.  The  second  of  these  makes  use  of  the  decay  of  the  inverse 
of  banded  matrices.  Theoretical  and  numerical  results  are  presented  for  the  comparison 
of  efficiencies.  In  Section  3,  the  parallel  subspace  iteration  algorithm  for  the  generalized 
eigenvalue  problem  is  described.  A  shift  is  introduced  and  analyzed,  and  computational 
results  are  presented.  Section  4  concludes  by  pointing  out  advantages  and  limitations  of  our 
algorithm. 

2  Parallel  banded  solvers 

The  Parallel  Partition  HI  (PPT)  algorithm  and  the  Parallel  Diagonal  Dominant  (PDD) 
algorithm  were  proposed  for  solving  tridiagonal  linear  .systems  on  multicomputers  in  [27]. 
Here  we  extend  them  to  banded  linear  systems.  The  PPT  algorithm  is  similar  to  an  algorithm 


introduced  by  Lawrie  and  Sameh  in  [18].  The  PDD  algorithm  is  a  variant  of  PPT  which 
uses  the  fact  that  the  entries  in  the  inverse  of  a  banded  matrix  decay  away  from  the  main 
diagonal. 

Let  us  consider  a  system  of  order  n 

Ax  =  d  ( 1 ) 

where  is  a  nonsingular  banded  matrix  with  lower  band  width  m;  and  upper  band  width  m^; 
specifically,  =  0  if  f  —  j  >  m/  or  j  —i  >  »«„.  Let  p  denote  the  number  of  processors  used, 
and  for  convenience  assume  that  n  =  and  that  the  half  band  width  m  :=  max(m/,mu)  < 
7(5/2.  Following  [18]  we  partition  as  a  block  p  x  p  matrix  with  blocks  of  order  U5: 

.42  0-2 

A  =  B3  y43 


Let  A  =  diag  [>fi,  >l2i  •  •  • »  ^p]  and  write  A  =  A  +  Ay4.  The  main  assumption  of  the  PPT 
and  PDD  algorithms  is  that  both  A  and  A  arc  nonsingular.  The  n*  x  n,  submatrices  Bi  and 
C,  have  the  form 


0  Ri 
0  0 


0  0 
Li  0 


where  Ri  is  m/  x  m;  upper  triangular  and  L,  is  m„  x  rn^  lower  triangular. 
Let  V  and  E  be  block  p  x  (p  —  1 )  matrices  of  the  form 


•Cl 

B2  C2 


F 

G  F 
G 


with  Us  X  (mi  +  mu)  blocks 


0  0  ■ 

0  L.J’ 


Ri  0 

0  0,  ’ 


0  0 


0  Im. 
0  0 


where  Ik  denotes  the  identity  matrix  of  order  k.  Set  ;=  [mi  +  mu)  (p  —  1).  Then  V  and 
E  are  n  x  n,  matrices  and  Ay4  =  VE^ .  Next,  choose  any  n  x  [n  —  Ur)  matrix  H  so  that 
P  =  [E\H]  is  an  n  X  n  permutation  matrix.  Then  E^ E  =  H  —  In-nri  E^ H  =  0, 

and  E  =  0,  and  it  follows  that 


A~^AP  —  \  ^  ^ 


where  Z  =  /„,.  +  E^A  ’  V.  Consequently,  if  A  and  A  are  nonsingular,  so  is  Z. 


Now  using  the  Sherman- Morrison- Woodbury  formula  [13],  the  solution  of  equation  (1) 
can  be  expressed  as 

X  =  A-^d  =  iA  +  VE'^)-U 

=  A-^d  -  A-WiLr  +  E'^A-^V)-^E'^A-U. 

Using  this  representation  equation  (1)  can  be  solved  in  the  following  steps: 

1.  Solve  i(x,  K)  =  (d,  V) 

2.  Form  Z  =  /„,  -I-  E^Y  and  h  =  E^x,  then  solve 

Zy  =  h. 

3.  Calculate  x  =  x  —  Yy. 


The  matrix  Z  in  Step  2  is  called  the  reduced  matrix,  and  has  the  form 


fmt 

Z12 

Z21 

Z24 

Z31 

■fmj 

Z34 

Z43 

Anu 

Z53 

•  . 

Z2p-4,2p-2 

Imi  Z2p—3,2p—2 

^2p— 2,2p— 3  fmu 

The  matrix  Z  can  be  reformulated  as  a  tridiagonal  block  matrix  by  the  permutation  of  the 
(2i  —  l)-th  and  the  2t-th  column  blocks.  Therefore  the  reduced  system  can  be  treated  as  a 
banded  system  with  the  upper  and  lower  band  widths  of  m/  -f  —  1 .  We  reiterate  that  the 
above  matrix  partitioning  scheme  is  essentially  the  same  as  that  proposed  in  [18]. 

We  now  outline  the  PPT  algorithm. 


Parallel  Partition  LU  (PPT)  Algorithm: 

•  Input:  A,  d 

•  Output:  X 

•  In  parallel,  do  on  all  processors  Pi,  i  =  1, . . .  ,p: 


1.  Solve  Aiixi,Y)  =  idi,Vi),  where  V  =  [l/„...,Up]^  Y  =  [Yt,...,Ypf,  with  V, 


and  Yi  ria  x  blocks,  and  d  =  [di, . . . ,  dp]^,  x  =  [ii,...,Xp]^  with  dj  and  i, 
n,-vectors.  Stop  if  Ai  is  algorithmically  singular. 

2.  Following  an  all-to-all  communication,  form  the  reduced  system  Zy  =  h. 
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3.  Solve  the  reduced  system. 

4.  Calculate  the  i— th  block  of  the  solution  x:  x,  —  it  —  Ky- 


The  total  number  of  data  transferred  is  (m/ +  rriu)  (m/ +  r/iu -t-  l)(p—  1)  (see  [27j).  The 
number  of  flops  is  roughly  (Sniii^  +  ^nm)jp-i  16(p  —  l)m'^  with  m  =  max(m;,mu),  the  half 
band  width. 

As  indicated  in  [27],  the  PPT  algorithm  performs  well  for  narrow  banded  systems  when 
the  number  of  processors  is  small  (say  p  <  16).  However,  the  efficiency  of  the  algorithm 
decreases  quickly  as  either  the  number  of  processors  p  or  the  half  band  width  Jii  increases, 
since  Steps  2  and  3  of  the  algorithm  are  a  bottleneck  in  both  computation  and  data  com¬ 
munication.  This  is  true  for  all  known  variants  of  the  matrix  tearing  techniciue  used  here  [9] 
[10].  Under  certain  conditions  this  bottleneck  can  be  removed.  As  was  already  mentioned 
above,  the  entries  of  the  inverse  of  a  nonsingular,  banded  matrix  decay  away  from  the  main 
diagonal  (see  [6]).  When  the  decay  in  the  inverse  of  A  is  sufficiently  rapid,  the  reduced 
matrix  Z  can  be  approximated  by  a  block  diagonal  matrix  Z  with  blocks  of  order  nii  -f 
The  PDD  algorithm  is  the  PPT  algorithm  with  Z  replaced  by  Z.  When  the  PDD  algorithm 
can  be  iised,  it  is  much  more  efficient  than  the  PPT  algorithm. 

We  now  analyze  the  conditions  under  which  the  PDD  algorithm  is  applicable.  For  sim¬ 
plicity,  we  assume  that  the  half  band  width  w  =  nii  =  We  will  u.se  the  notation 

'4  (U  :  ji  :  ji)  to  denote  the  (ij  —  U  -f  1)  x  —  ji  +  1 )  submatrix  of  A  with  upper  left 
corner  at  entry  and  lower  right  corner  at  entry  (i-ijz)-  Also,  let  ||  1]  denote  the 

oo-norm. 

For  /  =  1, ...  ,p  -  2  the  off-diagonal  blocks  Z2,+i,zi-\  and  Zzi'zi+z  aro  obtained  from 


and 

and  hence 

and 


■^2., 21+2  =  (1  :  uKiis  -  rn  -f  1  :  n  j  L,+x 

^2i+i,2.-i  =  {jis  -  m  -f  1  :  Jis,  1  :  m) 

11^21, 2.+2II  <  ||Ar+,  (1  :  771, Us  -  m  +  1  :  H 


II ■^21+1, 2.- ill  <  ||A,+,  {7is  -771+  \  :7i„\  :  7n)ll||/?,+i|l. 

Using  Proposition  2.3  of  [6],  it  can  be  shown  that  there  are  constants  and  q,  ('  >  0  and 
0  <  <7  <  1,  so  that 


max 

l<i<p-2 


777,  7ls  -  777  +  I  :  77^)]],  (77^  -  777  -f  1  :  77,, 


»0ll}  < 


where  dis  =  77,  —  2777  +  1  is  the  shortest  distance  from  the  main  diagonal  to  an  entry  of 
A~^x  (^  •  —  777  +  1  :  77,)  or  A~^x  (77,  —  777  -f  1  :  77,,  1  :  777).  If  A  is  positive  definite,  dis 

can  be  replaced  by  'Idis.  When  77,/777  ^  1,  these  off-diagonal  blocks  can  be  neglected  and 
the  reduced  matrix  Z  can  be  replaced  by  the  block  diagonal  approximation  Z.  Our  test 
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for  applicability  of  the  PDD  algorithm  is  as  follows.  Let  u  denote  machine  precision  (unit 
roundoff),  and  let  c  denote  a  small  positive  constant.  We  use  PDD  if 


maXi<,<p_2  ||■^2t+1.2i-l||} 

ll^ll 

This  means  that  PDD  is  used  only  if  the  perturbation  in  Z  caused  by  zeroing  the  off-diagonal 
blocks  is  comparable  to  the  roundoff  error  introduced  by  solving  the  reduced  system  by  a 
numerically  stable  method.  Equation  (2)  will  be  referred  as  test  (2)  in  the  following  sections. 
We  now  outline  the  PDD  algorithm. 


<  cu. 


(2) 


Parallel  Diagonal  Dominant  (PDD)  Algorithm; 

•  Input:  m-banded  A,  d 

•  Output:  Xdd  (an  approximate  solution  of  Ax  =  d) 

•  In  parallel,  do  on  all  processors  Pi,  i  =  I,. . .  ,p: 


1.  Solve  yl,(i,,  V;)  -  (d,,  V^). 

2.  Send  ^^(1  :  7ii,2rti{i  —  2)  +  I  :2m(i  —  2)  +  m)  and  x,(l  :  m)  to  the  processor 
Pi_i  (<  /  1).  Form  the  i-th  block  of  matrix  Z  {i  ^  p): 

[An  ^2»-l,2il 
^2i  2i— 1  fm  J 

3.  Solve 


An  2^2i-l,2i 
An 


y  (2m  (i  —  1)  -|-  1  :  2mi)  =  h  (2m  (i  —  1 )  m  +  1  :  277ii) 


on  Pi  (^  ^  p),  then  send  y(7n(2i  —  1)  1  ;  277ii)  to  the  processor  P.+, . 

4.  Calculate  the  i-th  block  of  x^j:  (xjj)i  =  Xi  —  Yiy. 


Note  that  the  system  at  Step  3  can  be  treated  as  a  banded  system  with  half  band 
width  771  if  a  permutation  of  column  blocks  is  used.  The  total  number  of  flops  is  roughly 
[SimP  -)-  5nm)/p  and  the  number  of  data  transferred  is  uP  -f-  2/7?. 

Our  Hybrid  Algorithm  is  simply:  use  PDD  if  applicable;  otherwise  use  PPT. 

The  PPT  and  the  PDD  algorithms  have  been  implemented  on  a  64-node  NCUBE-1 
computer  for  tridiagonal  linear  systems  [27].  We  present  in  Tables  1  and  2  the  numerical 
results  of  the  two  algorithms  for  banded  matrices  on  a  16-node  Intel  iPS(72  computer.  The 
testing  banded  matrices  are  of  the  form  A  =  with 


(l{j  —  \ 


2771  -f-  S 
0 


if  0  <  I*  —  il  <  m 
if  i  =  j 

otherwise 


Table  1: 


m 

n 

P 

Execution  Time  (seconds) 

Ratio 

PPT 

.SPT 

PDD 

SDD 

SDD/PDD 

10 

640 

1 

■H 

■■■ 

2 

4,811 

2.598 

4.768 

1.83 

(s=69) 

4 

6.429 

1.887 

6.144 

2.89 

8 

1.907 

7.434 

0.922 

3.90 

■01 

3200 

1 

HI 

2 

13.186 

24.231 

13.160 

24.288 

1.84 

1.85 

1.00 

4 

32.117 

9.741 

31.800 

3.18 

3.26 

1.04 

Wgm 

8 

.36.010 

4.816 

35.160 

6.21 

7.30 

1.20 

16 

4,643 

.38.109 

2..370 

36.390 

8.21 

15.35 

1.96 

20 

3200 

1 

2 

43.678 

78.893 

43.514 

78.979 

1.82 

1.00 

(s=12) 

4 

.34.774 

106.706 

32.629 

104.976 

3.22 

1.07 

8 

121.335 

16.106 

116.594 

7.24 

1.39 

16 

22.319 

7.771 

118.908 

HUH 

15.30 

2.87 

where  5  is  chosen  so  that  the  ofF-diagonal  blocks  in  the  reduced  matrix  Z  can  be  neglected. 
In  our  numerical  experiments,  we  have  used  Gauss  elimination  with  partial  pivoting  to  solve 
the  reduced  system  and  a  value  c  =  10  for  the  test  in  equation  (2).  In  our  experiments,  the 
accuracy  of  the  PPT  and  the  PDD  algorithms  is  comparable  to  working  precision. 

Table  1  shows  the  execution  time  for  different  algorithms  and  the  relative  speedups.  SPT 
and  SDD  are  the  sequential  algorithms  for  PPT  and  PDD.  They  are  executed  on  a  single 
processor.  We  use  them  to  measure  the  relative  speedup  of  the  PPT  and  PDD  algorithms. 
In  our  tables,  n  is  the  order  of  the  test  matrix  and  p  is  the  number  of  processors  being  used 
for  all  the  algorithms  except  SPT  and  SDD.  For  SPT  and  SDD,  p  is  the  number  of  blocks 
in  the  partitioning  scheme. 

As  already  indicated,  the  solution  of  the  reduced  system  Zy  =  /i  is  a  computation  and 
communication  bottleneck  of  the  PPT  algorithm.  Table  2  contains,  for  each  algorithm, 
columns  for  the  percentage  of  time  spent  on  communication  (comm.  %  )  and  for  the  time 
spent  on  computation  (comp.  %  )  at  Step  2  and  3.  It  is  clear  from  both  theoretical  and 
experimental  results  that  the  PDD  algorithm  is  much  more  efficient  when  it  is  applicable. 

3  A  Parallel  Algorithm  for  the  Banded  Generalized 
Eigenvalue  Problem 

In  this  section,  we  consider  generalized  symmetric  eigenproblems  of  the  form 

Ax  =  \Bx,  (3) 

where  A  and  B  are  symmetric,  m-banded  matrices  and  B  is  positive  definite.  We  wish  to 
approximate  the  q,  q  n,  smallest  eigenvalues  Ai  <  Aj  <  . . .  <  A,  and  corresponding 
eigenvectors  Xi,...,a;,.  This  type  of  problem  frequently  arises  in  vibration  mode  analysis 


6 


[2]  [3],  [15],  where  A  and  B  are,  respectively,  the  stiffness  matrix  and  mass  matrix.  The 
eigenvalues  A,  are  the  squares  of  the  the  free  vibration  frequencies  and  the  eigenvectors  x, 
are  the  corresponding  mode  shape  vectors. 

Many  sequential  algorithms  for  solving  the  symmetric  eigenvalue  problem  have  been 
developed  including  those  in  [1].  In  recent  years  parallel  algorithms  for  advanced  computers 
have  appeared  such  as  those  found  in  [11]  [15]  [17]  [19]  [20]  [22]  [24].  Most  of  these  algorithms 
require  reduction  to  tridiagonal  form  for  an  equivalent  standard  problem,  or  computation  of 
all  eigenpairs.  The  algorithm  proposed  in  this  paper  is  a  parallel  subspace  iteration  algorithm 
for  finding  the  q  smallest  eigenvalues  and  corresponding  eigenvectors  that  takes  advantage  of 
the  assumption  that  A  and  B  are  banded.  Subspace  iteration  [2]  [4]  [23]  [25]  [26]  is  a  reliable 
and  cost  effective  method  for  solving  the  eigenvalue  problem  considered  here  to  moderate 
accuracy.  It  has  been  used  extensively  in  a  number  of  general  purpose  finite  element  analysis 
programs  [2]. 

Using  a  shift  s,  a  symmetric  matrix  pencil  {/I,  B),  with  B  positive  definite,  can  be 
transformed  to  a  positive  definite  matrix  pencil  {A  —  sB,  B).  This  shift  leaves  eigenspaces 
unchanged.  For  the  moment  let  us  assume  that  the  pencil  (^4,  B)  is  positive  definite. 

The  basic  subspace  iteration  algorithm  then  consists  of  the  following  steps: 

•  Establish  a  starting  subspace  of  dimension  q  spanned  by  the  columns  of  5^°^  where 
q  >  q  and  q  is  the  number  of  eigenpairs  to  be  calculated. 

•  For  =  0, 1, . . . ,  until  the  first  q  eigenvalues  satisfy  a  stopping  criterion. 

1.  apply  the  Rayleigh-Ritz  (RR)  procedure  to  extract  the  “best”  eigenvalue  and 
eigenvector  approximations  from 

2.  improve  by  an  inverse  iteration 

yl5:(fc+i)  ^  Qs(k) 

The  starting  subspace  can  be  generated  2is  discussed  in  [2]  or  can  be  generated  randomly. 
The  user  can  input  the  value  of  the  default  value  is  ^  =  min{2(/,q'  +  8}  based  on  the 


7 


numerical  experiments  found  in  [2].  The  details  of  the  RR  procedure  are  described  in  the 
RR  algorithm  : 


RR  Algorithm: 

•  Input:  S 

•  Output:  eigenpair  approximations  for  Ax  =  \Bx  from  S 

1.  Orthonormalize  the  columns  of  S  with  respect  to  B  to  obtain  Q  written  over  S, 
i.e.  Q'^BQ  =  I. 

2.  Form  the  Rayleigh  Quotient:  Ha  =  Q^AQ. 

3.  Find  eigenpairs  of  Ha-  j  —  1?  •  •  •  >  9- 

4.  Form  the  “best”  eigenpair  approximations  from  S:  {9j,Xj),  with  Xj  =  Q<i>j,  j  = 
1, •••,9- 


Let 


n{k+i)  Mk) 

Jk)  _  , 


(4) 


Since  lim*;_oo  the  efficiency  of  the  subspace  iteration  method  depends 

on  A,/A,-+i,  the  ratio  of  the  largest  desired  eigenvalue  A,  and  the  eigenvalue  A^+i  for  the 
pencil  (/4,  B).  If  this  ratio  is  small,  as  in  inverse  iteration  with  a  good  shift,  only  a  few 
iterations  will  be  needed.  In  this  case,  the  fact  that  the  method  is  simple,  does  not  require 
reduction  to  tridiagonal  form,  and  economizes  on  data  movement  from  memory,  make  it  an 
ideal  algorithm,  especially  for  high-performance  computers.  We  note  that  the  most  operation 
intensive  parts  of  the  algorithm  are  the  first  two  steps  of  the  RR  procedure  and  the  solution 
of  the  linear  systems.  The  computation  cost  for  the  eigenpairs  of  Ha  can  be  ignored  since 
q  n.  The  RR  procedure  can  be  implemented  efficiently  on  a  variety  of  architectures  using 
computationally  primitive  BLAS  [1]  [7].  The  linear  systems  involved  are  banded  and  positive 
definite. 

For  our  parallel  subspace  iteration  algorithm,  we  divide  the  banded  pencil  {A,  B)  into  p 
blocks,  with  n,  rows  per  block. 


{A,B)  = 


All  B\ 

A2,  B2 


Ap,  Bp 


and  assign  one  block  to  each  processor.  Our  algorithm  is  outlined  as  follows. 


Parallel  Subspace  Iteration  Algorithm: 
•  Input:  (A,  B),  tol 


(5) 
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•  Output:  (Aj,Xi),  i  =  1, . . .  ,q,  first  9  eigenpairs  of  (>1,  B) 


•  In  parallel,  do  on  all  processors  Pj,  j  =  1, . . .  ,p: 


1.  Calculate  concurrently  a  shift  s,  then  shift  (/I,  B)  to  B)  with  /f*  =  A  —  sB. 

2.  Generate  the  j— th  block  of  a  starting  matrix  where  is  an  n,  x  q 
matrix. 


3.  For  A:  =  0, 1, , 

—  Apply  the  RR  procedure  concurrently. 

-  Solve  by  the  PPT  or  PDD  algorithm  until  the  computed 

eigenvalues  of  (Aj,  B)  satisfy 


i^p)  _  ^(*-'>1 

1«!‘' + s| 


<  tol,i  =  1, ...  ,9. 


4.  Set  A,-  :=  +  s,  Xi  :=  x\'^\  i  =  1, . . . ,( 


The  selection  of  the  shift  s  at  Step  1  is  critical  to  the  reliability  and  efficiency  of  the 
algorithm.  There  are,  however,  several  competing  requirements  placed  on  the  shift  s.  First, 
for  the  convergence  and  stability  of  the  algorithm,  the  shift  s  should  satisfy 


max  lAi  —  s|  <  max  |A'  —  s|, 

i<j<9  ^  j'>g  ^ 

and  .s  should  not  be  an  eigenvalue  or  too  close  to  an  eigenvalue  of  (A,B)  [2].  Second,  in 
order  to  reduce  the  number  of  iterations,  .s  should  be  close  to  Aj,  j  =  1,...,^,  since  the 
rate  of  convergence  of  is  (Aj  —  .s)^/(A^+i  —  s)^.  Finally,  to  obtain  the  greatest  efficiency 
the  shift  s  should  be  chosen,  whenever  possible,  so  that  the  PDD  algorithm  can  be  used  to 
solve  the  systems  As.S'b+D  =  BS^'\  because  the  PDD  algorithm,  among  all  parallel  banded 
solvers,  is  one  of  the  most  efficient  parallel  algorithms.  If  it  is  not  possible  to  find  a  shift 
.s  so  that  As  satisfies  test  (2),  then  the  PPT  algorithm  must  be  used  to  solve  this  system. 
Based  on  these  requirements,  we  use  the  following  process  to  select  a  shift  s. 

Shift  Selection  Strategy: 

1.  Approximate  the  smallest  eigenvalue  Ai  of  (A,  B),  then  shift  (A,  B)  to  (A^, ,  B),  where 
As,  —  A  —  Si  B  with 

5]  =  Ai  —  ^  for  a  6  >  0 

and  8  is  made  small  compared  to  A].  Since  the  smallest  eigenvalue  of  (As, ,  B)  is  <5,  the 
pencil  (As,  ,B)  continues  to  be  positive  definite,  but  has  a  small  extreme  eigenvalue. 
This  guarantees  the  convergence  and  stability  of  the  iterations,  and  also  minimizes  the 
number  of  iterations.  Find  a  positive  lower  bound  for  the  smallest  eigenvalue  of  A. 
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2.  Compute  the  off-diagonal  blocks  of  the  reduced  matrix  for  and  the  timing  ratio  for 
the  PPT  and  PDD  algorithms 


^  time  — 


execution  time  of  PPT 
execution  time  of  PDD 


The  ratio  rtime  is  a  function  of  the  matrix  parameters  n,  m,  p,  and  Tcomp/Tcomm,  where  p 
is  the  number  of  processors  used,  and  Tcomp  and  Tcomm  are  the  speeds  for  floating  point 
computation  and  data  communication,  can  be  estimated  from  the  complexities  of 
the  PPT  and  PDD  algorithms  given  in  Section  2  or  measured  by  numerical  experiments. 

3.  If  test  (2)  is  satisfied  by  ^4^,  or  rti,ne  ~  1,  then  set  s  :=  Sj,  and  stop. 

4.  If  A  passes  test  (2),  apply  the  Shift  Refinement  algorithm  (below)  to  (/4s,,  B),  and 
e  =  Si,  and  set  S2  —  t,  s  =  Si  —  S2,  and  stop.  Here  t  denotes  the  output  of  the  Shift 
Refinement  algorithm. 

5.  Apply  test  (2)  to  B.  If  B  does  not  pass  test  (2),  then  set  S2  :=  0,  s  ;=  5],  and  stop. 

6.  Attempt  to  compute  a  positive  4  so  that  A  +  C,B  satisfies  test  (2).  If  such  an  C  is 
found,  apply  the  Shift  Refinement  algorithm  to  (A,  B)  and  e  =  (.  Set  S2  :=  -f  t  and 
s  :=  Si  —  S2;  otherwise  set  S2  =  0  and  s  :=  si. 

In  the  following,  we  give  the  details  of  Steps  1  and  6. 

Computation  of  si: 

The  inverse  iteration: 

AxW  = 


= 


k  =  1,2, ... , 


where  is  a  normalization  constant,  is  used  to  obtain  an  approximation  of  Ai.  The 
iteration  {(Ai*'^  converges  to  the  first  eigenpair  of  (A,  B)  since  the  matrix  pencil  (A,  B) 
is  positive  definite.  Few  iterations  are  needed  at  this  step  because  only  a  crude  approximation 
of  Ai  is  necessary.  In  our  experiments,  the  linear  systems  Ax^^^  =  are  solved  by 

the  Hybrid  algorithm  introduced  in  Section  2.  We  note  that  matrix  factorization  is  needed 
only  at  the  first  iteration.  The  iterations  are  terminated  when 

“  ''1  I  ^  ir.-2  /"7\ 


and  S]  is  chosen  a.s  si  =  O.QSAj*'^  which  implies  that  S  w  0.04Ai.  Similarly,  we  find  a  positive 
lower  bound  for  the  smallest  eigenvalue  of  A  for  use  in  Step  6.  The  quantity  8{A)  required 
in  Step  4  of  the  Shift  Selection  Strategy  is  a  by-product  of  the  inverse  iteration  in  Step  1. 

Computation  of  52: 

This  shift  is  introduced  only  when  A*,  has  failed  to  pass  test  (2)  and  rtime  ^  1-  This  is  the 
situation  when  the  PDD  algorithm  can  not  be  applied  to  the  systems  and 

the  PPT  is  much  more  expensive  than  the  PDD  algorithm  (most  parallel  banded  solvers  [9] 
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[10]  [14]  [18]  have  roughly  similar  computation  and  communication  complexities  as  those  of 
PPT).  As  indicated  by  the  theoretical  and  experimental  results  of  Section  2,  rt,me  increases 
significantly  when  the  number  of  processors  or  the  band  width  increases.  If  shift  s-^  can 
be  found,  it  can  be  used  to  remove  the  bottleneck  of  the  PPT  algorithm.  However,  while 
this  shift  reduces  execution  time  for  each  single  iteration,  it  increases  the  total  number  of 
iterations  at  the  same  time.  Whether  this  shift  can  and  should  be  performed  depends  on 
the  eigenproblem  (3)  and  the  machine  architecture.  We  now  describe  Step  6. 

For  a  positive  definite  m-banded  matrix  M  it  is  shown  in  [6]  that 


/y/F-  ly^”^  ^  XmaAM) 

\\/F  +1/ 


where  Xmax{^)  and  Xmin{M)  denote  the  largest  and  smallect  eigenvalues  of  M.  If  Mi  is  a 
diagonal  block  of  M,  then  it  follows  that  7(A/,)  <  7(A/).  Applying  test  (2)  to  M ,  we  find 
that 

S{M)  < 

where  C  is  a  constant  which  depends  on  M.  We  will  use  the  heuristic 

6{A  +  (B)  «  C'(7(A  +  CB)f^ 

and  assume  that  the  conc.tant  C  does  not  depend  on  We  determine  it  by  setting  C  = 

i(.4)/(7(A))‘'“. 

We  use  inverse  iteration  to  find  0  <  cq  <  Xmin{B)  and  0  <  do  <  ATOtn(^)>  and  we  use 
Gershgorin’s  Theorem  to  find  c\  =  maxj  2i=i  \bij\  >  Xna^B)  and  d\  =  max^  Xrj=i  > 
Amai(A).  The  lower  bound  do  was  found  in  Step  1.  In  many  cases  B  heis  special  structure 
(such  as  diagonal  or  diagonally  dominant)  so  that  cq  can  be  found  analytically.  We  approxi¬ 
mate  7(A),  7(fi),  and  7(A-f-C5)  using  =  di/do,  vb  =  Ci/co,  and  =  (di -t-Cci)/(do  +  Cco)- 
These  7’s  and  r’s  are  upper  bounds  for  the  exact  values.  If 

=  ’>cu, 

then  the  heuristic  has  failed  and  we  set  S2  :=  0  and  s  :=  Si.  Otherwise,  the  heuristic  can  be 
solved  for  ^  by  setting  ^(A  +  (B)  =  cu  and  solving  for  (: 


c  = 


di  /cq  -  rdp/cp 
r  -  Vb 


,r  = 


<'c) 


(8) 


Finally,  if  A  -|-  ^B  does  not  satisfy  test  (2),  the  heuristic  fails  and  we  set  S2  :=  0  and  s  :=  Si. 


Let  A  and  B  be  partitioned  as  in  equation  (5).  The  following  Shift  Refinement  Algorithm 
uses  local  bisection  iteration  to  approximate  the  optimal  shift  S2. 

Shift  Refinement  Algorithm: 

•  Input:  Kmax,  and  (A,  B),  e  >  0  so  that  A  +  eB  satisfies  test  (2) 

•  Output:  0  <  <  <  e  so  that  A  +  tB  satisfies  test  (2) 
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Table  3:  Execution  Time  of  the  Parallel  Subspace  Iterations  on  iPSC/2.  (a  =  1.0) 


n 

P 

Tp  (seconds) 

K 

TxIT, 

11 

P 

Tp  (seconds) 

K 

T,/Tp 

1200 

1 

178.867 

5 

1200 

1 

297.537 

6 

2 

92.849 

5 

1.93 

2 

158.383 

6 

1,88 

II 

4 

48.798 

5 

3.67 

(m  =  10) 

4 

83.953 

6 

3.54 

8 

26.467 

5 

6.76 

8 

44.328 

6 

6.71 

16 

15.473 

5 

11.56 

16 

24.907 

6 

11.95 

3600 

1 

443.994 

4 

3600 

1 

763.193 

5 

2 

225.086 

4 

1.97 

2 

396.942 

5 

1.92 

(m  =  5) 

4 

116.258 

4 

3.82 

(m  =  10) 

4 

213.064 

5 

3.58 

8 

60.248 

4 

7..37 

8 

108.022 

5 

7.07 

16 

31.807 

4 

13.96 

16 

56.136 

5 

13.60 

Table  4:  Execution  Time  of  the  Parallel  Subspace  Iterations  on  iPSC/860.  (q  =  0.1) 


m 

n 

P 

Tp  (seconds) 

m 

n 

P 

Tp  (seconds) 

PDD 

PPT 

PDD 

PPT 

10 

3200 

1 

69.745 

69.745 

15 

6400 

2 

♦ 

♦ 

4 

20.856 

23.090 

4 

55.472 

61.219 

8 

10.775 

14.110 

8 

27.725 

32.583 

16 

10.744 

16 

14.342 

21.978 

32 

11.988 

32 

7.906 

♦ 

64 

♦ 

64 

4.679 

* 

*  Exceed  memory  capacity;  **  Shift  S2 's  used. 


•  In  parallel,  do  on  all  processors  P,,  i  =  1, . . .  ,p: 

1.  If  the  corresponding  off-diagonal  blocks  of  Z  for  Ai  passed  the  test  (2), 

set  ti  ;=  0; 
else 

initialize  f  1°^  :=  e,  :=  e/2;  for  A:  =  1, . . . ,  Kmax, 

/  u\ 

compute  corresponding  off-diagonal  blocks  of  Z  for  Ai  -}- 1]  Bi. 

If  the  test  (2)  is  satisfied,  :=  —  e/2*''''"’; 

otherwise  set  :=  -f  e/2*'''''’. 

Set  ti  A-  =  0, . . . ,  Kmax} 

endif 

2.  Following  a  global  communication,  get  t  in(ix{t,,  i  =  1, . . . 


In  Tables  3-6,  we  present  the  numerical  results  of  the  Parallel  Subspace  Iteration  algo¬ 
rithm  on  iPSC72  and  iPSC/860  multiproces.sors.  The  program  was  written  in  Intel  Fortran 
using  its  communication  library.  Some  Linpack  [8]  and  BLAS  [7]  Fortran  source  codes  were 


Table  5:  Execution  Time  of  the  Parallel  Subspace  Iterations  (n/p  =  85,  o  =  0.1) 


D 

PDD  with  S2 

PPT  without  S2 

H 

T7(iPSC7860) 

wm 

Tp  (iPSC/2) 

■a 

4 

7.391 

86.101 

6.879 

19 

8 

7.584 

112.811 

8.899 

19 

16 

7.729 

24 

166.732 

12.541 

19 

3-2 

■IIH 

7.954 

24 

20.489 

20 

used  on  the  nodes.  The  testing  matrices,  except  in  Table  6,  are  A  =  B  =  /„,  with 

1  if  0  <  |i  —  <  m 

‘2m  +  ai  if  i  =  j 
0  otherwise 

where  rv  roughly  equals  the  separation  of  eigenvalues.  T-p  is  the  execution  time  in  seconds 
using  p  processors  which  includes  the  time  spent  on  the  shift  selection,  and  K  is  the  number 
of  iterations.  The  number  of  required  eigenpairs  is  </  =  10.  The  error  tolerance  tol,  except  in 
Table  5,  is  chosen  such  that  the  eigenvalues  are  approximated  to  about  six-digit  accuracy.  In 
Tables  3  and  4,  excepting  the  case  marked  by  **,  all  testing  matrices  .4s,  passed  the  test  (2). 
Thus,  the  FDD  algorithm  is  applied  directly  without  any  extra  iterations.  For  comparison, 
in  Table  4,  the  results  of  using  the  PPT  algorithm  are  presented.  The  efficiencies  are  much 
higher  when  the  PDD  algorithm  is  applied,  especially  for  large  problem  sizes.  Table  4  shows 
that  if  the  number  of  processors  is  small,  the  size  of  a  problem  may  exceed  the  memory 
capacity  of  the  multiprocessors.  More  processors  must  be  employed.  However  as  the  number 
of  |)rocessors  increases,  the  order  of  the  resulting  reduced  system  may  become  too  large  for  a 
single  processor.  In  this  situation  the  PDD  algorithm  is  the  only  choice  for  solving  the  linear 
systems  involved  even  though  it  may  cause  more  iterations.  For  the  experiments  presented 
in  Tables  3  and  4,  the  order  ii  of  A  and  B  was  held  constant.  In  Table  5  experiments  are 
presented  in  which  the  order  of  the  local  submatrices  is  held  constant  at  -  =  85  and  the  order 

'  .  p 

ti  of  A  and  B  is  scaled  with  the  number  of  processors.  Here,  the  error  tolerance  was  also 
increased  to  nine-digit  accuracy.  The  half  band  width  is  rn  =  10.  These  results  demonstrate 
the  parallel  efficiency  of  the  shifting  strategy,  especially  on  a  large  number  of  processors. 
For  these  problems,  as  the  number  of  processors  increases,  the  efficiency  remains  unchanged 
when  .S2  is  used.  Without  it,  the  performance  of  the  algorithm  deteriorates  quickly.  The 
parallel  scalability  afforded  by  the  shifting  strategy  more  than  offsets  the  initial  overhead  in 
computing  S2  and  the  additional  number  of  iterations.  This  seems  to  suggest  that  as  long  as 
the  shift  .S2  is  relatively  small  and  introduces  a  moderate  number  of  iterations,  using  it  with 
the  PDD  algorithm  should  be  more  efficient  than  the  PPT  algorithm,  especially  for  large 
problems  on  a  large  number  of  processors.  When  the  shift  S2  is  too  large,  it  can  cause  a 
significant  increase  in  the  number  of  iterations.  Whether  the  shift  .S2  should  be  performed  is 
])roblem  and  machine  dependent,  and  an  a  prior  criterion  is  not  available.  Fortunately,  once 
the  shift  .S2  is  performed,  the  efficiency  of  the  shift  can  be  as.sessed  after  a  few  iterations. 

Assessing  the  efficiency  of  the  shift  .S2: 

Assuming  the  iteration  vector  has  reached  its  asymptotic  rate  of  convergence,  that 


Table  6:  Execution  Time  of  the  Parallel  Subspace  Iterations  (PPT) 


n 

P 

Tp  (seconds) 

Ti/Tp 

— 

A 

640 

1 

156.258 

12 

2.3  X  10-^ 

2 

78.719 

1.99 

11 

2.0  X  10-® 

4 

40.831 

3.83 

10 

5.1  X  10-® 

8 

25.815 

6.05 

10 

2.7  X  10-’’ 

16 

16.828 

9.29 

8 

1.4  X  10-® 

3600 

1 

796.953 

11 

3.4  X  10"® 

2 

422.973 

1.88 

11 

1.5  X  10-® 

4 

163.940 

4.86 

8 

1.2  X  10-® 

8 

75.556 

10.55 

7 

2.3  X  10-^ 

16 

43.429 

18.35 

7 

1.6  X  10-® 

is  (see  equation  (4)),  the  convergence  rates  of  the  two  approaches  (PDD 

with  shift  S2  or  PPT  without  shift  S2)  can  be  estimated  by  r^d  =  rj*)  and  Vpt  =  — 

,S2)/(A(*'^/^/t^  — 52))^  respectively.  Let  tol  =  The  numbers  of  iterations  for  these  two 

approaches  are  roughly  Udd  =  —dflogrdd  and  Upt  =  —dflogrpt.  The  timing  ratio  of  the  two 
approaches  is 


_  Time{PPT)  X  Upt  _  logr^j 

Time{PDD)  x  ridd  *'**’"*  ^  logrpt' 


If  Riime  “C  1,  the  shift  $2  should  be  abandoned  and  /I,,  used  on  In  this  case,  the  use 

of  shift  S2  may  cause  a  large  increase  in  the  number  of  iterations. 

Finally  we  present  interesting  numerical  results  for  an  eigenvalue  problem  of  a  simply 
supported  beam  from  structural  mechanics.  The  stiffness  matrix  A  is  positive  definite  with 
half  band  width  3,  while  the  mass  matrix  B  is  a  diagonal  matrix  with  a  wide  range  of  diagonal 
entries.  The  problem  comes  from  the  discretization  of  a  differential  system  by  central  finite 
differences.  Consequently,  the  largest  eigenvalue  of  {A,B)  is  0{n^).  Since 


^max  {A,B)<X 

max  {A)f Xmin^B')  ^  d] /A„j,>j (5), 


(  fa  di  >=  Amaxl^,  B)  •  Xmin{B)  ~  0{n^)  as  n  -+  00 

(see  equation  (8)),  the  shift  52  would  be  very  large  if  it  is  needed.  In  this  example,  matrix 
A  fails  test  (2)  and  the  shift  S2  is  too  large,  therefore  the  linear  solver  PPT  is  used.  Table 
6  shows  the  performance  of  the  algorithm.  It  is  interesting  to  note  that  when  the  number 
of  processors  increeises,  the  number  of  iterations  K  decreases.  Although  when  using  the 
PPT  with  a  large  number  of  processors,  the  Parallel  subspace  iterations  algorithm  generally 
has  low  eflSciency,  here  “super  linear”  speedups  are  observed.  In  this  example,  smaller 
subsystems  are  much  better  conditioned  than  larger  ones.  This  may  explain  the  decrease  in 
K  with  increasing  p.  The  answer  awaits  further  investigation. 
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4  Remarks  and  Conclusions 


The  parallel  algorithm  proposed  here  needs  little  more  storage  than  its  setpiential  version. 
Most  other  parallel  eigensolvers  recjuire  that  each  processor  have  direct  access  to  the  entire 
matrix  pencil  [A,  B),  while  in  our  algorithm,  the  matrix  pencil  (A,  B)  and  the  iterative 
vectors  are  divided  evenly  into  blocks,  which  are  allocated  to  corresj)onding  processors. 
Each  processor  operates  only  on  its  own  blocks  of  {A,  B)  and  most  of  the  time.  All 
processors  solve  identical  subproblems  and  communicate  the  same  amount  of  data  at  each 
step  of  the  computation.  The  load  is  perfectly  balanced.  The  shift  .>2  reduces  data  references 
between  processors  and  greatly  increases  the  parallel  efficiencies  in  some  situations.  When 
the  number  of  processors  is  large,  secondary  memory  is  usually  not  necessary  even  for  large 
l)roblems.  Data  transfer  between  different  levels  of  memory  can  be  reduced  by  employing 
block  matrix  computations,  such  as  BLAS  [1]  [7].  When  the  number  of  processors  or  the 
band  width  is  large,  the  size  of  the  reduced  system  Zy  =  h  can  become  prohibitive  for 
the  PPT  algorithm  and  most  other  banded  solvers.  In  this  situation,  shift  with  its 
possibly  high  number  of  iterations,  seems  to  be  the  only  alternative.  The  efficiency  of 
the  shift  S2  is  machine  dependent.  Our  numerical  results  suggested  that,  on  iPSC/’^ 
iPSC/SfiO  hypercube  multiprocessors,  this  shift  can  lead  to  higher  efficiencies  when  the  ratio 
of  largest  to  smallest  eigenvalues  of  the  pencil  is  0(n),  but  it  is  detrimental  when  this 
ratio  is  O(n^)  or  greater,  which  is  typically  the  case  with  problems  derived  from  differential 
equations.  However,  in  this  case,  the  numerical  experiments  of  Table  6  indicate  that  the 
PPT  version  of  our  algorithm  exhibits  “superlinear”  convergence.  This  may  occur  because 
smaller  submatrices  are  much  better  conditioned  than  larger  ones. 

Many  acceleration  schemes  have  been  applied  to  the  basic  sequential  subspace  iteration 
method  [3]  [26],  making  it  efficient  for  a  wide  variety  of  applications.  Most  of  these  acceler¬ 
ating  schemes  can  be  easily  incorporated  into  our  parallel  algorithm. 
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